feat(eval): episode sharding, parallel launcher, and autotune#3275
Closed
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
Closed
feat(eval): episode sharding, parallel launcher, and autotune#3275pkooij wants to merge 2 commits intofeat/async-vector-envfrom
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
Conversation
4 tasks
b43f9ab to
1f7e7b4
Compare
Add lerobot-eval-parallel and lerobot-eval-autotune entry points for multi-process evaluation. A single H100 running 4 shards of SmolVLA achieves ~100% GPU utilisation vs ~0.5% with the serial baseline. - EvalConfig: add shard_id / num_shards fields; validate ranges - lerobot_eval.py: _shard_episodes() splits n_episodes round-robin; eval_main uses per-shard n_episodes + seed offset; writes shard_K_of_N.json when num_shards > 1 - lerobot_eval_parallel.py: spawns K subprocesses with disjoint shard IDs, sets MUJOCO_GL and OMP_NUM_THREADS, merges results on completion - lerobot_eval_autotune.py: probes GPU VRAM, CPU cores, optional model footprint and env step time; derives optimal num_shards / batch_size / MUJOCO_GL; prints a paste-ready command - pyproject.toml: register lerobot-eval-parallel and lerobot-eval-autotune Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all already supports running multiple task groups concurrently via ThreadPoolExecutor, but policy.reset() was not thread-safe: all threads shared the same policy object and its mutable state (action queues, temporal buffers). Fix: each thread receives a shallow copy of the policy. copy.copy() creates a new Python object whose _parameters dict is a shared reference — same tensor storage, zero extra VRAM — while reset() rebinds per-episode state to fresh objects per thread. Caveat: ACT with temporal_ensemble_coeff is not safe with this approach (its reset() mutates a shared sub-object). Keep max_parallel_tasks=1 for that config. For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises GPU utilization from ~20% to ~60-80% with no additional VRAM cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b411838 to
66276f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title
feat(eval): episode sharding, parallel launcher, and autotune
Type / Scope
lerobot/scripts/lerobot_eval.py,lerobot/configs/default.py, newlerobot_eval_parallel.py, newlerobot_eval_autotune.pySummary / Motivation
Even after PR #3274 fixes
AsyncVectorEnv, a single eval process achieves only ~20% GPU utilisation (env step ~20ms >> inference ~5ms). The remaining idle time can be recovered by running multiple independent eval processes (shards), each handling a disjoint slice of episodes and its own model copy. On an H100 (80 GB), SmolVLA at fp16 (~14 GB) fits 4–5 times → 4 × 20% ≈ 80–100% GPU utilisation with zero networking or coordination overhead.This PR adds:
lerobot_eval.py: each process handles episodesshard_id, shard_id+N, ...with non-overlapping seeds.lerobot-eval-parallel: spawns K subprocesses, setsMUJOCO_GLandOMP_NUM_THREADS, merges results.lerobot-eval-autotune: probes GPU VRAM, CPU cores, model footprint, and env step time; outputs optimalnum_shards / batch_size / MUJOCO_GLwith a paste-ready command.Related issues
What changed
configs/default.py(EvalConfig): addshard_id: int = 0,num_shards: int = 1; validate ranges in__post_init__lerobot_eval.py: add_shard_episodes(n_episodes, shard_id, num_shards) → list[int];eval_maincomputes per-shard episode count and seed offset; writesshard_K_of_N.jsonwhennum_shards > 1, elseeval_info.json(default unchanged)lerobot_eval_parallel.py(new, ~120 LOC): parse--num-shards/--render-device; spawn K subprocesses; wait; merge shard JSON files intoeval_info.jsonlerobot_eval_autotune.py(new, ~140 LOC): 8-step hardware probe →AutotuneRecommendation;main()prints summary + paste-ready commandpyproject.toml: registerlerobot-eval-parallelandlerobot-eval-autotuneentry pointsDefault behaviour is unchanged:
num_shards=1→ exactly the same execution path as before.How was this tested (or how to run locally)
Tests added:
test_shard_assignment:_shard_episodes(100, 2, 5) == [2, 7, 12, ..., 97]test_shard_uneven: 103 episodes / 5 shards distributes without overlap or gaptest_shard_no_overlap: union of all shards == full episode rangeSingle-machine parallel run:
Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes
subprocess.Popen(fork+exec) gives each shard a clean Python interpreter and its own valid EGL/osmesa context — no stale GPU handles inherited from the parent.seed + K * ceil(n_episodes / num_shards), so the combined run is equivalent to one serial run with the same seeds.--render-device auto: uses EGL (GPU) for 1 shard; switches to osmesa (CPU rendering, 0 VRAM) when multiple model copies would exhaust VRAM.